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^ Abstract 

^ , 
["t I , Genetic fitness optimization using small populations or small population updates 

^vq ' across generations generally suffers from randomly diverging evolutions. We pro- 

pose a notion of highly probable fitness optimization through feasible evolutionary 
\j\\ computing runs on small size populations. Based on rapidly mixing Markov chains, 



the approach pertains to most types of evolutionary genetic algorithms, genetic pro- 
gramming and the like. We establish that for systems having associated rapidly mix- 
Q . ing Markov chains and appropriate stationary distributions the new method finds 

optimal programs (individuals) with probability almost 1. To make the method use- 
ful would require a structured design methodology where the development of the 
^ . program and the guarantee of the rapidly mixing property go hand in hand. We an- 

f^ \ alyze a simple example to show that the method is implementable. More significant 

<^ • examples require theoretical advances, for example with respect to the Metropolis 

g : filter. 

o 



1 Introduction 



Performance analysis of genetic computing using unbounded or exponential 
population sizes or population updates across generations [27,21,25,28,29,22,8] 
may not be directly applicable to real practical problems where we always have 
to deal with a bounded (small) population size [9,23,26]. 

Considering small population sizes it is at once obvious that the size and 
constitution of the population or population updates may have a major impact 
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on the evolutionary development of the population. We aim to establish a fast 
feasible speed of convergence to a distribution of populations from which we 
can obtain by Monte Carlo sampling an optimal type individual with high 
probability. The method we propose clearly can be used by a wide range of 
genetic computing models which includes genetic algorithms on strings and 
genetic programming on trees, and so forth. Application of the method to 
problems is another matter; we have examples solving trivial problems but 
we don't have an example solving a difficult problem. The main question for 
future research is to supply such an application. 

The structure of the paper is as follows. In Section 2 we explain the finite 
Markov chain model for genetic processes. The states of the chain correspond 
to finite populations. The transition probability between two states is induced 
by the selection, reproduction, and fitness rules, as in [19,26,14]. 

Since the evolution from generation to generation is a random process, using 
finite populations different evolutions may diverge. This is not the case when 
we consider evolutions of probability density distributions. The idea is to view 
such processes as corresponding with infinite populations that are completely 
transformed in each generation. Or to view them as an approximation to very 
large populations with very large updates between generations or as an average 
of all possible evolutions from a finite population. Such evolutions considered 
by several authors as a convenient vehicle to analyze genetic processes are 
completely deterministic. In Section 3 we show that even if we view such a 
deterministic evolution as an "average" evolution, this average may behave 
very different from every particular real evolution. The crucial point here is 
how far a particular evolution (generally) strays from the average. We analyze 
the relation with the population size and the population update size. 

Under mild conditions that guarantee ergodicity the Markov chain converges 
to a stationary distribution over the set of states (the set of reachable popula- 
tions). From this stationary distribution we can sample a set of populations. 
If the total stationary probability concentrated on populations containing an 
individual of best fitness is large enough then this process finds such an in- 
dividual with high probability. For this approach to be workable we must 
have small enough populations and the convergence to the stationary distri- 
bution has to be fast. Convergence to stationarity is fast enough in "rapidly 
mixing" Markov chains. Such chains have recently been the basis of spectacu- 
lar randomized approximation algorithms, combinatorial counting, statistical 
physics, combinatorial optimization, and certain quadratic dynamic processes 
related to genetics of infinite populations, [20,4,7,22]. Section 4 introduces 
them in general genetic computing. The efficiency of our technique in applica- 
tions depends crucially on the rate of convergence of the Markov chain. Since 
the number of states is typically very large, the chain should reach equilibrium 
after each particular evolution has only explored a tiny fraction of the state 



space. 

For the theory of genetic computing it is important that we demonstrate a for- 
mal method of genetic fitness optimization (apphcable to restricted classes of 
GA, GP, and related optimization problems) together with a rigorous analysis 
demonstrating that this strategy is guaranteed to work with high probability, 
rather than intuitive heuristic or ad hoc arguments. For the application of 
genetic computing we find that because of the sampling from the stationary 
distribution the proposed process uses a large number of short runs as opposed 
to one long run. [^ 

Just to show that the method is meaningful we demonstrate it on a toy prob- 
lem in Section 5 that in fact is trivially successful because of the abundance 
of optimal solutions. Really significant examples are currently much harder — 
and already beyond the scope of this exploration. Further along is the devel- 
opment of a structured methodology to set up the genetic system (selection, 
reproduction, fitness) such that the resulting Markov chain is rapidly mixing, 
and, moreover, such that the types with sufficiently high fitness will be ob- 
tained with sufficiently high probability from the (close to) final stationary 
state distribution. What we have in mind is a design methodology to develop 
a genetic system satisfying these requirements from the specifications of the 
problem statement. 



2 The Model 



Assume that r is an upper bound on the number of different possible types 
of individuals, say a set fi = {0, . . . ,r — 1}. Such individuals can be strings, 
trees or whatever — our discussion is so general that the precise objects don't 
matter. Even if the set of types can grow (such as trees) then practically 
speaking there will still be an upper bound r. The genetic system tries to 
solve an optimization problem in the following sense. Each individual in Q 
is graded in terms of how well it solves the problem the genetic system is 
supposed to solve, expressed as a function / which maps Q to some grading 
set G. For example, G can be the real interval [0, 1]. Let f{u) be the fitness of 
type u. Then, the normalized fitness of individual u is 
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^ Prom a more applied perspective several researchers observed earlier that it pays 
to restart on a new population when the evolution takes a unpromising direction, 
for example [13,9]. Also J. Koza and L.J. Eshelman have algorithms that specifically 
restart automatically (GP, CHC, respectively), as do many others. 



To fix thoughts, we use fitness proportional selection where selection of in- 
dividuals from a population is according to probability related to the prod- 
uct of frequency of occurrence and fitness. That is, in a population P = 
(P(l), . . . , P(r)) of size n, where type u occurs with frequency P{u) > with 
J2u€nP{u) = n, we have probability p{u) to select individual u (with replace- 
ment) for the cross-over defined by 

p{u) - 
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It is convenient to formulate the generation of one population from another 
one as a Markov chain. Formally: 

Definition 1 A sequence of random variables (X()^q with outcomes in a 
finite state space T = {0, . . . , N—1} is a finite state time-homogeneous Markov 
chain if for every ordered pair i,j of states the quantity qij = Pr(Xi_|_i = 
j\Xt = i) called the transition probability from state i to state j , is independent 
oft. If Ai is a Markov chain then its associated tTansition matrix Q is defined 
as Q := {qij)f~2Q- The matrix Q is non-negative and stochastic, its row sums 
are all unity. 

Now let the Markov chain M. have states consisting of nonnegative integer 
r-vectors of which the individual entries sum up to the population size exactly 
n and let V denote the set of states of M.. The number of states A^ := jj=-V is 
given by [19] 




(This is the number of ways we can select r — 1 elements from n + r — 1 
elements. If the elements constitute a linear list and the r intervals marked by 
the selected elements — exlusive the selected elements — represents the elements 
of the r types the result follows directly.) The associated transition matrix 
Q = {qij) is a. N X N matrix where the entry qij is the probability that the 
kth generation will be Pj given that the (/c — l)st generation is Pi {Pi, Pj G V). 

A general closed form expression for transition probabilities for simple GA's 
is derived in [19] and its asymptotics to steady state distributions as popu- 
lation size increases is determined. In [14] it is observed that the mentioned 
closed form expression allows expression of 'expected waiting time until global 
optimum is encountered for the first time', 'expected waiting time for first op- 
timum within some error tolerance of global optimum', and 'variance in such 
measures from run to run', and so on, but no further analysis is provided. 
Instead, initial experimental work is reported. Here we are interested in quan- 
titative estimates of such expressions. 



Example 1 Consider a process where the generation of a next population 
P' from the current population P consists of sampling two individuals -u, v 
from P, removing these two individuals from P {P" := P — {u, v}), producing 
two new offspring w, z and inserting them in the population resulting in a 
population P' := P"[j{w, z}. 

The transition probability qp^p' of 

P^ P' 



where P' results from sequentially executing the program "P{u) := P{u) — 1; 
P(v) := P(v) - 1; P{w) := P{w) + 1; P{z) := P{z) + 1; P' := P," replacing 
the pair of individuals u, v hj w,z with {u, v} r\{w, z} = 0, is given by 

qp^P' := 2p{u)p{v)b{u, v, w, z), (2) 



where the local transition probability b(u, v, w, z) is the probability of producing 
the pair w,z from the selected pair -u, f, incorporating both the mutation 
probability and the cross-over probability. The r x r x r x r matrix B = 
{b{u,v,w,z)) is called the local transition matrix. We can generally obtain 
such transition probabilities between states P ^ P' of the Markov chain. O 



3 Evolutionary Trajectories 



Small population size or sample size may cause evolutionary trajectories to 
drift apart. Without loss of generality, this can be illustrated in a simplified 
setting ignoring fitness selection. 



3. 1 Transformation of Distributions 



Some approaches use the expedient to simply ignore the actual populations 
and deal with the probability density p(-) of types rather than with the number 
of occurrences P(-) of types in a population P. We show that the idea that the 
deterministic evolution is some sort of "average" of all evolutions of underlying 
populations has problems. Given a distribution density p and a local transition 
matrix B = {b{u,v,w,z)), let the transformation p' = g{p) be defined by 

P'(^) '■= J2 ( P{u)p{v) ^ 6(m, V, w, z) I , (3) 

u,v \ w / 



where B is such that p' is again a probabihty distribution. Consider a (not 
necessarily finite) population of individuals, each individual being of some type 
■uG{0,...,r — 1}. Let p{u) be the probability of selecting an individual of type 
u. When a pair of individuals of types u, v mate then they produce a pair of 
individuals of types w, z with probability b{u, v, w, z). Assuming that a mating 
of a pair must result in a pair of offspring means that Y.w,zb{u,v,w,z) = 1. 
The resulting probability of z is p'{z) in Equation 3. Then, 



'^p{u)p{v)b{u, V, w, z) = p'{w)p'{z) 

u,v 

Y.p'{w)p'{z) = 1. 



Probability density evolution has particular nice properties that can be demon- 
strated not to hold for population evolutions. In particular, probability density 
evolution converges. A distribution p is called an equilibrium distribution (with 
respect to transformation g) if g{p) = p. In [3,19] for simple GA with fitness 
selection, and [21] for more general quadratic dynamical systems but without 
fitness selection, the following convergence property is derived. 

Theorem 1 The sequence p^,p^, . . . with p* = g^{p^) (t > 0) converges to an 
equilibrium distribution lim^^oo P* = P- 

In certain infinite evolutionary models equivalent to the above transformation 
of probability densities the evolution develops deterministically according to 
Equation 3. But in practice things are different. Namely, the single evolution 
of the probability density may be very different from every evolution of a 
represented population. 

If the populations are small, or the population updates across successive gen- 
erations are small, then we are dealing with a random process and chance se- 
lections can cause great divergence of evolution of populations. In the practice 
of evolutionary computing this is always the case. We would like to quantify 
this. Primarily considering probability densities, neither [19] nor [21] explores 
in an explicit quantitative manner the divergence of trajectories of individ- 
ual runs based on population sizes. They rather focus on the issue that as 
the population size grows, the divergence of possible trajectories gets progres- 
sively smaller. In the limit, for infinite populations, the generations in the 
run converge to the expected trajectory for smaller populations. Clearly, if 
all trajectories are in a small envelope around the expected trajectory, then 
the expected trajectory is a good predictor for what happens with an indi- 
vidual run. If moreover the expected trajectory corresponds to the trajectory 
of Equation 3, as in the system analyzed in [19], then the analysis of trans- 
formation g tells us what to expect from our individual bounded population 



evolution. 

However, the expected trajectory can be completely different from, all individual 
trajectories; and if the individual trajectories of bounded populations diverge 
wildly, then the expected trajectory may not predict anything about what 
happens to an individual run. Analysis like in [19,21] do not deal with an 
individual run of a genetic algorithm, but rather with the sequence of expec- 
tations over all individual runs of the system. Such an expectation may not 
say anything about what actually happens. 

To see this, consider a dictatorial coin which gives a first outcome or 1 with 
fair odds. However, afterwards it always gives the same outcome. So it either 
produces an all run or an all 1 run with equal probabilities. The expectation 
of obtaining a at the tth trial is 1/2. However, in actual fact at the tth (t > 1) 
trial we have either probability 1 or probability for outcome 0. In terms of the 
above formalism, initially, p{0) = p{l) = 1/2. To express the "dictatorial coin" 
in terms of evolutionary processes and to analyze what happens we continue 
the above Markov chain termonology. 

For s G Af, the s-step transition matrix is the power Q* = (g|j) with gf^- = 
Pr(Xt_|_s = j\Xt = i), independent of t. Denote the distribution of Xt by the 
row vector tt* = (ttq, . . . , 7r^_i) with tt* = Pr(X( = i). If 7r° denotes the initial 
distribution then tt* = tt^Q* for all t G A/". Often we have 7r° = 1 for some i 
(and elsewhere) in which case i is called the initial state. 

Definition 2 The chain is ergodic if there exists a distribution vr over V 
with strictly positive probabilities such that 

lim p? ■ = TT,-, 



for all Pi,Pj G V. In this case we have that tt* = tt^Q* -^ n pointwise as 
t -^ oo, and the limit is independent o/tt". The stationary distribution tt is 
the unique vector satisfying ttQ = n, where J2i '^i = 1/ that is, the unique 
normalized left eigenvector of Q with eigenvalue 1. Necessary and sufficient 
conditions for ergodicity are that the chain should be irreducible, for each pair 
of states Pi,Pj G V there is an s E M such that p? • > (Pj can be reached 
from Pi in a finite number of steps); and aperiodic, the gcd{s : pfj > 0} = 1 
for all Pi, Pj e V . 

An ergodic Markov chain is (time-) reversible iff either (and hence both) of the 
following equivalent conditions hold. 

• For all Pi, Pj eV we have PijUi = Pj.i^fj- That is, in the stationary distribu- 
tion, the expected number of transitions per unit time from state Pi to state 
Pj and from state Pj to state Pi are equal. For an ergodic chain, if tt is a 
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Fig. 1. Dictatorial ^-transformation 

positive vector satisfying above condition and the normalization condition 
X^jTTj = 1, then the chain is reversible and vr is its stationary distribution. 
The matrix D^I'^QD"^!'^ is symmetric, where D^/^ is the diagonal matrix 



diag(7r, 



1/2 

5 



1/2 



_^) and D ^^"^ is its inverse. 



Example 2 We can formulate a "dictatorial coin" example in the evolution- 
ary format of Equation 3. Let the B transformation be given by Figure 1 [^ 
where < e < |. The evolution of the probability densities by transformation 
of the distributions according to Equation 3 gives 



p -^ p 



p 



with p'{0) = p'{l) = 1/2 (= p(0) = p(l)) by symmetry between "0" and "1." 
But if we look at what the system does in actual evolutions then the following 
happens. Consider a state space V consisting of the two-element populations: 
Pq = {0, 1}, Pi = {0, 0} and P2 = {1, 1}. To pass from one generation to the 
next one, with uniform probability and with replacement draw two elements 
from the current population. Subsequently, execute a cross-over according to 
the B matrix. The resulting two individuals form the next generation. For 
example, from a population Pq = {0, 1} we obtain as the next generation 

• with probability 1/4 a population Pi = {0,0} with pi(0) = l,pi(l) = 0; 

• with probability 1/4 a population P2 = {1,1} with p2(0) = 0,p2(l) = 1; 
and 

• with probability 1/2 a population P3 = {0, 1}(= Pq) withp3(0) = 1/2, ^3(1) = 
1/2. 



'^ This B gives rise to an ergodic Markov chain, has strictly positive entries, and 
it satisfies b{u,v,w,z) = b{v,u,w,z) = b{u,v, z,w) = b{v,u, z,w) and hence is 
symmetric in the sense of [21]. 



The associated Markov chain of this process is given by the matrix 



Q:- 



/i 1 1 \ 

2 4 4 

2e 1 - 4e 2e 
y2e 2e 1 - 4e j 



where the entry qij gives the transition probabihty of going from state (pop- 
ulation) Pi to Pj, 0<i,j <2. 

Denote the probabihty of being in state Pi after t steps by vr*, and tt* = 
(tTqjTT^, TTg). Since the Markov chain is ergodic (or by simple inspection) the 
pointwise limit linii^oo vr* ^ tt exists where tt is the stationary distribution. 
Solving TiQ = IT gives 



4e 
TTn = > for e ^ 

" l + 4e 

TTi = > - for e ^ 

2 + 8e 2 

no = > - for e ^ 

2 + 8e 2 

In fact, starting from population Pq after t steps and with e small enough to 
satisfy t <^ — log e we will be in population Pq with probability ~ 1/2*, and in 
both population Pi and population P2 with probability ~ 1/2(1 — 1/2*). Once 
we are in population Pi or P2 we stay in that population at the next generation 
with probability 1 — 4e, for small e > this is almost surely. Therefore, the 
evolution from Pq will quickly settle in either Pi or P2 and henceforth remain 
there for a long time. 



Additionally we observe that for e = 1/8 we have that Q is equal to its 

'1 1 1^ 

^3' 3' 3' 



transpose Q^ and the stationary distribution n = {\AA) and therefore the 



chain is reversible. 



3.2 Finite Population: Large Sample 



Assume that we have a finite population P with probability density p{u) of 
drawing individual u in the selection phase. Assume furthermore that in the 
selection phase we draw a sample of cardinality s. The larger s is the better 
we can approximate p by the resulting frequencies. Quantitatively this works 
out as follows. 

Let there be r types of individuals in Q and let s{u,v) be an outcome of the 



random variable measuring the number of outcomes of the pair {u, v) in s 
trials. By Chernoff' s bound, see for example [15], 

2 e^s 

Vi {\s{u,v) — p{u)p{v)s\ > es} < — with a 



e" 3p{u)p{v) 

Let p'{) be the next probability distribution as defined in Equation 3, and 
let p'{) be the frequency distribution we obtain on the basis of the outcome 
s{u,v) in drawing s examples. For our further considerations the dependence 
of a on u,v is problematic. It is convenient to replace a = a{u,v) by an a' 
independent of u, v and a' < e^s/3 < a{u, v). Then, 

2 

Pi {\s{u,v) — p{u)p{v)s\ > es} < — j 



for every u,v & Q. This gives the probability that we exceed the value es for 
one pair (m, v). The probability that we exceed the value es for some pair (-u, v) 
is upper bounded by 2r^/e° . Hence the probability that we do not exceed the 
value es for any pair {u, v) is at least 1 — (2r^/e" ). We can now conclude that 
for every 2; G fi, the absolute error oi the estimate p'{z) is 



\p\z)-p\z)\<Y, 



p[u)p[v) 

s 

< e ^ 6(m, f , w, z) = er. 



Y,b{u,v,u!,z) 



with probability at least 1 — 2r^/e"'. For example, choose e = l/s^^^ and 
sample size s with s^^^ > 3p{u)p{v) to ensure that e^s/3 > a' > s^^'^, for all 
u,v & Q. (The upper bound of a' was required by the relation between a' and 
a.) 

Then, for all types z G f2, for s^^^ > r 

r 1 1 2r^ 2s^/'* 

Pr [ip'iz) -p'iz)\ < ^} > 1 - ^(> 1 - ^- 



That is, for growing s the probability that the estimator p'{z) differs from the 
real p'{z) by more than 1/s^/® decreases as e~^ . We call such a sample large 
because it is polynomial in the number of types r which typically means that 
the sample is exponential in the problem parameter / (as when Q = {0, 1}'). 
It is possible to estimate a boundary envelope on the evolution trajectories 



^ Cubic results appearing in [10,9] are cubic in the population size n and refer to 
different issues. 
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around the single evolution trajectory determined by Equation 3 as a function 
of the sample size. This may be the subject of a future paper. 



3.3 Finite Population: Small Sample 



Consider a population P with types out of Vt with associated probability den- 
sity p(-) of types over Vt and local transition matrix B = {b{u,v,w, z)). We 
generate the next population by drawing a small sample consisting of one 
pair u, V of individuals from P (without replacement) according to probability 
density p and replace u,v in P by a pair w, z with probability 6(m, f , w, z) to 
obtain population P' with associated probability density p'(-)- 

In a concrete computational run of a genetic algorithm this means that given 
a population P with associated density distribution p(-) we obtain with proba- 
bility p{u)p{v)b{u, V, w, z) a distribution p'(-) being the associated probability 
density of the population P' resulting from eliminating a pair of individuals 
M, V from P and adding a pair of individuals w, z as in Example 1. Start with 
a population of size n containing VL{n) different types and let each types have 
positive probability to be selected to produce the next generation. Then there 
are VL{n'^) different distributions that can be obtained from p(-) this way (by 
Equation 3). 

Repeating this procedure, we potentially obtain in t steps up to n"^* distribu- 
tions. For example, if _B = {b{u, v, w, z)) is strictly positive and the associated 
Markov chain is ergodic then this means that in 

logA^ 
to" 



41ogra 



generations we can possibly realize the total range of all A^ different popula- 
tions and therefore all distributions p'{-)- Every population P G P is obtained 
with some probability in to generations. The single deterministic evolution of 
to generations of probability distributions 



p = p^ ^ p^ ^ ■ ■ ■ ^ p*° 



according to Equation 3 gives the expectation p^° {z) of an individual of type z in 



^ In a more restricted setting of a quadratic cross-over system with Q = {0, 1}' 
reference [22] shows that the probability distribution of an infinite quadratic cross- 
over system (without fitness selection) stays for the duration of an evolution of t 
generations in an appropriate sense close to that of a population of size 0{n'^t) 
initially drawn randomly from the infinite population. 
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the toth generation, but it does not say anything about the actual probabihty 
density p^'^{h) in the tgth generation of an actual evolution. 



4 Towards a Discipline of Evolutionary Programming 



The upshot of the considerations so far is that with limited size populations 
and population updates the variation in evolutions is very great. In practice 
we always deal with very limited size populations such as, say, 500 individuals. 
The question arises how to overcome the problem that an individual evolution 
can become trapped in an undesirable niche — as in Example 2 — for example 
a niche consisting of populations with non-optimal individuals. The answer 
is that we need to randomize over the evolutions. Inspecting the populations 
in such a random sample of evolutions we want to find, almost surely, an 
individual of best fitness. The latter is easy if the set of inspected evolutions is 
so large that it covers almost all populations. But this is infeasible in general. 
Let us look at two easy tricks that point the way we have to go. 

Example 3 (Using the Law of Large Numbers) Consider an ergodic 
Markov chain associated with an evolutionary process. Using the law of large 
numbers, Ct{P)/t -^ 7r(P) as t ^ oo almost surely, where Ct{P) is the number 
of occurrences of population P in the first t generations, and 7t{P) is the 
stationary probability of P. Therefore, in the same manner, it is easy to show 
J2p(^p* Ct{P)/t — > n(V*) almost surely where V* is the set of populations that 
include an individual i* with the best fitness value, and 7r(P*) is the stationary 
probability that a population includes i*. 

But this approach doesn't give a speed of convergence guarantee. What we 
actually want is an approach that the expected time for an element of V* to 
show up is polynomial. One way to formulate a sufficient condition for this is 
that we guaranty that for all parameters e,6 > the probability 



Pr 



PeV* ^ 



>e <5, (4) 



with t polynomial in the problem parameter / (like the length of the individ- 
uals), 1/e and 1/5. Roughly speaking, this is achieved by "rapidly mixing" 
processes below. O 

Example 4 (Probability Boosting) AsM.O. Rabin and others have ob- 
served, the power of randomization over deterministic algorithms is that it can 
solve problems with probability almost 1 by repeated independent runs of the 
algorithm, provided each single run has probability of error appropriately less 
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than 1 (see the text [18]). A direct apphcation of probabihty boosting to evo- 
lutionary computing, [24], is as follows. Let T be a random variable defined 
as the first-case hit for an optimal solution by a randomized algorithm like a 
GA. Let the expectation satisfy E(T) < T where the upper bound T is a poly- 
nomial in the problem dimension / (like the length of the individuals). Now if 
the randomized algorithm is stopped after t > 2T steps then the best solution 
found so far may not be the globally optimal solution. The probability that it 
is not is P{T > t) which by Markov's inequality satisfies 

P(T > t) < ^^ < i. 
^ ' - t - 2 



After k independent runs (with independently random initial conditions) the 
probability that the global solution is found at least once is greater or equal 
to 1 - 1/2^ 

For this observation to be useful we must show that in the case of interest 
the expected running time up to first-case hitting time of an optimal solution 
(or approximately optimal solution) is polynomial in the problem dimension. 
In fact, it suffices if this is the case with respect to only a subset of the 
computations of appropriate positive probability, like in Equation 4. O 



4-1 Rapidly Mixing Markov Chains 



We follow the exposition in [20]. Given an ergodic Markov chain, consider 
the problem of sampling elements from the state space, assumed very large, 
according to the stationary distribution n. The desired distribution can be 
realized by picking an arbitrary initial state and simulating the transitions 
of the Markov chain according to probabilities Pij, which we assume can be 
computed locally as required. As the number t of simulated steps increases, the 
distribution of the random variable Xt will approach tt. The rate of approach 
to stationarity can be expressed in the following time- dependent measure of 
deviation from the limit. For every non-empty subset U ^ V, the relative 
pointwise distance (r.p.d.) over U after t steps is given by 



Au(t) = max 



Kj-^il 



i,j&U TTj 



This way, Aij{t) is the largest relative distance between tt* and tt at a state 
Pj G U, maximized over all possible states in U. The parameter U allows us 
to specify relevant portions of the state space. In case U = V we will omit the 
subscript and write A instead of A(/. 
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The stationary distribution n of an ergodic chain is the left eigenvector of Q 
with associated eigenvalue Aq = 1. Let Ai, . . . , Aat-i with Aj G C (the complex 
numbers) be the remaining eigenvalues (not necessarily distinct) of Q. By 
the standard Perron-Frobenius theory for non-negative matrices these satisfy 
|Ai| < 1 for 1 < ? < A^ — 1. The transient behavior of the chain, and hence its 
rate of convergence, is governed by the magnitude of the eigenvalues Aj. In the 
reversible case, the second characterization above implies that the eigenvalues 
of Q are those of the symmetric matrix D^^^QD~^^^ and so are all real. This 
leads to the following clean formulation of above dependence: 

Lemma 1 Let Q be the transition matrix of an ergodic reversible Markov 
chain, vr is stationary distribution, and Aq = 1, . . . , Aat.i its (necessarily real) 
eigenvalues. Then, for every nonempty subset U (^V and allt G J\f the relative 
pointswise distance over U satisfies 



Ac/(t) < 



^max 



minp^gf/TTj' 



where Amax is the largest value in |Ai|, . . . , |Ajv_i|. 

Lemma 2 With the notation of Lemma 1 the relative pointswise distance over 
V satisfies 



A(t) > A 



t 

max 



for every even t G N . Moreover, if all eigenvalues of Q are non-negative, then 
the bound holds for all t G N . 

Therefore, provided tt is not extremely small in some state of interest, the 
convergence of the reversible chain will be rapid iff Amax is suitably bounded 
away from 1. Such a chain is called rapid mixing. 

If we order the eigenvalues 1 = Aq > Ai > ■ ■ ■ > \n-i > —1 then Amax = 
max{Ai, IAat-iI} and the value of Aat.i is significant only if some eigenvalues 
are negative. The oscillatory behavior associated with negative eigenvalues 
cannot occur if each state is equipped with sufficiently large self-loop prob- 
ability. It is enough to have min^ g^-j > 1/2. To see this, let I^ denote the 
N X N identity matrix and consider the non-negative matrix 2Q — Jjv, whose 
eigenvalues are /ij = 2Aj — 1. By Perron-Frobenius, /ij > — 1 for alii E V which 
implies that Aat.i > 0. 

What do we do when we have negative eigenvalues? To be able to apply 
Lemma 2 without oscillations we require all eigenvalues to be positive. It 
turns out that we there is a simple modification of the chain with negative 
eigenvalues that turns it into a chain with only positive eigenvalues without 
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slowing down the convergence to stationarity too much. We simply increase 
the self- loop probability of every state by | after halving it first: 

Lemma 3 With the notation of Lemma 1, let the eigenvalues of Q he ordered 
1 = Aq > Ai > • ■ ■ > A^v-i > —1. Then the modified chain with transition 
matrix Q' = ^{In+Q), with I^ as above, is also ergodic and reversible with the 
same stationary distribution, and its eigenvalues A'j similarly ordered satisfy 
\'n-i > and A'^,, = A; = i(l + Ai). 

Following [20] we define rapid mixing. 

Definition 3 Given a family of ergodic Markov chains Ai (x) parametrized 
on strings x over a given alphabet. For each such x, let A^^^t) denote the 
r.p.d. of Ai{x) over its entire state space after t steps, and define the function 
T^^^ (e) from the positive reals to the natural numbers by 

r(")(e) = min{t : A^''\t') < t for all t' >t}. 



We call such a family rapidly mixing iff there exist a polynomial bounded 
function q such that T^^\e) < g(|a;|, loge^-*^) for all x and < e < 1. 

In the applications to evolutionary programming, x will be a problem instance 
and the state space of A^ {x) will include solution sets R{x) of some relation 
R. 

The question arises whether the approach to rapidly mixing Markov chains 
can be generalized from reversible chains to non-reversible chains. This was 
affirmatively settled in [17] and another treatment was later given in [7]. See 
the short discussion in [20]. 

Example 5 To compute the permanent of a dense matrix is ^^P-complete. 
The permanent of an n x n matrix A with 0-1 entries aij is defined by 



per A := XI n' 



ra-l. 



i=0 "'i,(y{i)) 



where the sum is over all permutations of the set {0, . . . ,n}. Since the class 
of T^P-complete decision problems includes the class of NP-complete decision 
problems, computing the permanent is at least NP-hard. 

A celebrated result of Jerrum and Sinclair [11] shows how to use rapidly mix- 
ing Markov chains to obtain a randomized algorithm that approximates the 
value of the permanent of a matrix A within ratio 1 + e with probability at 
least 3/4 in time polynomial m\A\ and | l/e| where | ■ | denotes the length of the 
binary representation. By probability boosting, we can by 0(\og5) iterations 
boost the success probability to at least 1 — 5. This breakthrough result has led 
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to a "Markov Chain Renaissance" to employ rapidly mixing Markov chains 
to obtain such "fully polynomial randomized approximation schemes (fpras)" 
to hard problems in computer science [5,20,1]. These applications generate a 
uniform stationary distribution from which the approximation is obtained by 
Monte Carlo sampling of the states and determining the proportion of the 
successful states. In our application to genetic computing we proceed differ- 
ently: with high probability we sample states containing a best fit individual 
(or an approximately best fit individual). We illustrate the idea by example 
in Section 5. O 



4-2 Optimization by Rapidly Mixing Evolutionary Algorithms 



To optimize by rapidly mixing evolutionary algorithms we require two prop- 
erties: 

(1) The stationary distribution tt of populations V of the associated Markov 
chain of the evolutionary process converges to concentrate a sufficient 
amount of probability on populations containing maximally fit individu- 
als, or on populations containing individuals that enable us to compute 
the required solutions. That is, 

PiGP* 

where V* is the set of populations containing at least one solution of best 
fitness (or a solution that approximates the global optimum, or solutions 
that enable us to compute the required solution or approximation) and ttj 
is the stationary probability of population Pj. To ensure feasibility of the 
algorithm we customarily require that 1/e is polynomial in the problem 
parameter. 

(2) The Markov chain of the evolutionary process converges sufficiently fast 
to the stationary distribution: it is rapidly mixing as in Section 4.1. 

The rapid mixing property (2) can be satisfied by having the evolutionary 
system satisfy some structural properties. Such properties can, at least in 
principle (if not in practice), always be taken care of while implementing the 
evolutionary system by choosing the selection rules, cross-over operator, and 
mutation rules appropriately. These requirements are covered in Section 4.3. 

The question of probability concentration, property (1), is more subtle, and 
it is not yet clear how to generally go about it, even in principle. In many if 
not most cases we are satisfied to obtain an approximately globally optimal 
solution. 
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4-3 A Discipline of Evolutionary Programming 



For a structural discipline of evolutionary programming we need to develop a 
methodology that given a problem specification guides us to construct an evo- 
lutionary system such that the associated Markov chain satisfies the following 
requirements: 

Property 1 the second largest eigenvaluS^ Amax is suitably bounded away 
far enough from 1 so that the Markov chain is rapidly mixing (Definition 3 of 
Section 4-l)i o.^'^d 

Property 2 the stationary distribution vr gives probability greater than e, 
where 1/e is polynomial in the problem parameter, to the set of states that 
contain individuals of best fitness. 

For Property 1 it is required that the matrices are (i) irreducible, and (ii) 
have nonnegative entries. Since the only matrices we consider are stochastic 
where the entries are transition probabilities, (ii) is in our case easy to satisfy 
up to the 'suitable' condition in Property 1. Since we only deal with ergodic 
matrices, and (i) is required for ergodicity. Property 1 is always satisfied in 
our case. Ergodicity is immediate if we have a positive mutation probability 
of transforming i into j for each pair of types i,j. Hence by proper choice 
of the genetic system leading to suitable transition probabihties inducing a 
rapidly mixing Markov chain one can satisfy Property 1 in construction of 
an evolutionary system. It is perhaps less easy to see whether it is feasible 
to satisfy Property 2 in each particular case, or indeed without knowing the 
optimal individual a priori. However, as discussed above a similar approach for 
approximating very hard combinatorial optimization problems, [20], worked 
out fine. 

Assume that we have defined our evolutionary system satisfying Properties 1, 
2. The program we use is then as follows. Repeat a polynomial number of 
times: 

Step 1: From a start state evolve through a polynomial number of genera- 
tions; 
Step 2: From the final population vector select the fittest individual. 

Paradigm. Running the program longer than a polynomial number of gen- 
erations will not significantly change the closeness of the state distribution to 
the stationary distribution in the Markov chain. We can only guarantee that 
we find a state (vector) containing an optimal fit individual with probability 



^ The second largest eigenvalue was used earlier in genetic computing to advantage 
for another purpose in [26]. 
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say inversely polynomial in the problem parameter. However, polynomially 
repeating this procedure implies Monte Carlo sampling which almost surely 
discovers the individual with optimal fitness. 



5 A Toy Rapidly Mixing Genetic Algorithm 



Consider a toy evolutionary problem as follows. We consider a population 
of size yl and very simple crossover only and some mutation. This example 
already illustrates adequately the rapid mixing phenomenon. The genetic al- 
gorithm G is defined as follows. The set of all program types is fi = {0, 1}' 
with / fixed, even, and large enough for the following analysis to hold. The 
fitness of a program u; G fi with uj = ujiuj2 .. .uji is given by the function 

I 
f{(jj) = 1 if ^o;, = //2, and 1/2 otherwise . 



The starting population P° at time to = contains yl copies of the individual 
00 ... 0; its cardinality (number of elements in P") is yl. We express the 
frequency of a string cj in a population P by #(^(P). That is, #oo...o(-P°) = v^ 
and #^(P°) = Oforcj^OO...O 

The transition of one population to the next generation (population) is as 
follows. To avoid problems of periodicity, we add self-loop probability of 1/2 
to each state (that is, population). Note that this also dispenses with the 
problem of negative eigenvalues. Consequently, there is probability 1/2 that 
the state changes using crossover and mutation, and there is probability 1/2 
that it stays the same. The probability p{uj) of selecting a string u from a 
population P is 

#„(p)/(^) ,,, 



In the selection phase we select two individuals in P, say uj'^,uj\ according 
to these probabilities, and with probability 1/2 we perform a crossover and 
mutation on each (and with probability 1/2 we do nothing). The crossover 
operator interchanges a single bit of a;* with the corresponding bit of uK It 
selects the single bit position with uniform probability l/l. Subsequently, we 
mutate each offspring by flipping a single bit with uniform probability l/l 
chosen from the positions 1 through /. (If i = j then the cross-over doesn't 
do anything and the two mutations may result in 0,1, or 2 bit flips of Ui.) We 
flrst prove that G is rapid mixing by showing that if the following system G' 
is rapidly mixing then so is G. 



Let G' be a system where the initial state is a binary /-vector. At each step uni- 
formly at random select a bit position of the current /-vector and flip that bit 
with fifty-fifty probability to produce the next /-vector. Then G' is a Markov 
chain where the states are the binary /-vectors. 

Lemma 4 The chain G' is rapid mixing with r.p.d. at most e within 0(1'^ {I + 
log(l/e))) steps. 

For a proof see [20], pp. 63-66. This system is an almost uniform generator for 
Q, using singleton populations, where it suffices to use an arbitrary starting 
singleton population. In terms of GA's it is single-bit mutation. Our example 
involves single-bit mutation, single-bit cross-over, and selection. The reader is 
advised that this is only a cosmetic change to make the example look more 
like a 'realistic' GA. Our toy example G is essentially the example G' as in 
Lemma 4. To see this, consider the vectors in successive generations P°, P^, . . . 
of G to maintain their identity. If P* = {co^'^, . . . , c<j*'^'} for t > and in the 
selection phase we select indices i,j, then u;*+^'^ = uj^''^ ioi < k < \^ and 
k 7^ i,j, or uj^~^^'^ results from u*'^ (the 'same vector') by at most two bit flips 
for h = i,j. 

Lemma 5 Lete>0 andT{l) = 0(/^/2(/ + log(l/e))). For eacht> T{1), with 
probability at least 1 — 1/T(/) and for each l-vector u, every l-vector u^'^ G P" 
has probability (1 ±e)/2' of being changed into u*'^ = u int generations of G. 

Proof. For a fraction of at least 1 — 1/t of all runs oi t > yl steps of a 
population of \/l elements, then each element j out of 1, . . . , v^ (representing 
the vector u''^) is selected with frequency of at least 



±0(,ii^) (6) 



2^^l V Vl 



in the selection phases of the generating process. This is shown similar to 
the statistical analysis of 'block frequencies' of high Kolmogorov complexity 
strings in [15], Theorem 2.15. 

Namely, consider t throws of a v/-sided coin, each pair of throws constituting 
the selection of the indexes of the two individuals mated to produce the next 
generation. There are 2*^*'°^')/^ possible sequences a; of t outcomes. Hence, the 
maximal Kolmogorov complexity is given by C(a;|t,/) < (tlog/)/2 -|- 0(1). 
Moreover, since there are only 2*^*'°s')/^/t binary descriptions of length < 
(tlog/)/2 — logt + 0(1), there is a fraction of at least 1 — 1/tth part of all 
sequences x which has G{x\t,l) > (tlog/)/2 — logt-l-O(l). Consider each such 
X as a binary string consisting of blocks of length log yl, each block encoding 
one of the v/ types. Let #j(a;) denote the number of occurrences of each of 
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the v/ blocks j (elementary outcomes) in x. Then, by [15] p. 163, 

l#j(x)-tMi< 



log Vl + log log Vl + logt + 0(l) g^ 



\ V/lo, 



ige 



Since individuals have fitness 1/2 or 1, some indexes may at various times 
have as low as half the probability of being selected than other individuals. 
Repeating the same argument for an 2v/-sided coin and represent by the first 
yl outcomes for indexes 1 through v/ and the remaining outcomes represent 
dummy indexes (possibly the original ones) we obtain the lower bound of 
Equation 6. 

Following the same vector in the successive generations, consider each time 
it is selected. At such times, with fifty-fifty probability either nothing is done 
or the vector incurs (i) a bit flip in a position which was selected uniformly 
at random because of the cross-over (or no bit flip if the bits in that position 
of the two parents happened to be the same), followed by (ii) a bit flip in 
a position selected uniformly at random because of the mutation. From the 
viewpoint of the individual vector and the mutation operations alone it simply 
emulates a trajectory of the singleton /-vector in Lemma 4 of length as given 
in Equation 6. The extra random bit flips due to the cross-over only increase 
the length of the emulation. 

Substitute t in Equation 6 by T{1) as in the statement of the lemma. By 
Lemma 4 the lemma is proven. □ 

Let uj be an /-vector. For every e > and t > T(/), every /-vector in the initial 
population P° turns into u in exactly t steps with probability at least (1 — 
l/t)(l ± e)/2'. Therefore, P° generates in t steps every particular population 
P of v/ individuals with probability 

{l-\{l±e))/N, 

where A^ is the number of v/-size populations. Then, the r.p.d. of G to the 
uniform stationary distribution vr (7r(P) = 1/A^ for all P G {0, 1}' with 
7^(P) = v/) after t > T{1) steps is bounded above by ^(1 + e). Choosing 
t > max{T(/), 1/e} + 1 the r.p.d. is upper bounded by e. 

Corollary 1 It follows that G is a rapidly mixing Markov Chain with a 
uniform stationary distribution. 

Lemma 6 The probability of finding a population with an optimally fit element 
in t runs is at least 1 — 2e^"* with a = c/(16(l — c)), for the fixed constant c 
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given in Equation 7. 



Proof. There are L,^] ~ 2'' / JtxI/2 strings with fitness 1. Hence a fraction of 
at most 



(1 - 1/^^)^ < e-v^ 

populations of size v/ contain no such strings. This means that a constant 
fraction of at least 



c=l-e-V^/-, (7) 

of the populations of size yl contain at least one string of fitness 1. 

Consider each run of T{1) generations an experiment with a success outcome 
if the final population contains an individual with fitness 1. Let the number 
of successes in t trials be s{t). Then, with P defined as 

(3 = FT{\s{t)-ct\ >6t} 
we have 

by Chernoff's bound. For 6 = c/2 we know that the number of successes 
s{t) > with probability at least 1 — j3. □ 

Theorem 2 ((Rapidly Mixing GA Algorithm)) Let e and T{1) he as 
in Lemma 5 and let a he as in Lemma 6. Repeat t times: run G for T{1) 
generations. This procedure uses 0{T{1) -t) elementary steps consisting of the 
generation from one population to the next population. (With t = I this is a low 
degree polynomial in I and e). The prohahility of finding an optimal element 
exceeds 

1 - 2e-"*, 

where a > 0, that is, with prohahility of failure which vanishes exponentially 
fast with rising t. 

Proof. By Lemmas 5, 6. □ 



6 Non-uniform Stationary Distributions 



In the above example the stationary distribution is uniform and success of 
the method depends on the abundance of populations containing an optimal 
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individual. However, we want the stationary distribution of populations to 
heavily concentrate probability on populations containing optimal or near- 
optimal individuals even if those populations are scarce. For example, if our 
fitness function is / : fi — > A/" and we extend / to populations P with f{P) = 
maxtjgp{/(a;)} then we want to generate a random element from a distribution 
concentrated on the set of optimum solutions. This is similar to generating 
a random P from a distribution vr where 7r(P) = 0(2-^*^^)/") with a a small 
positive number. Then, with large probability a random P will maximize /. A 
general method to modify a random walk so that it converges to an arbitrary 
prescribed probability distribution is the Metropolis filter, [16]. Let's explain 
a simple example of this. Suppose Q = {0, 1}', we are dealing with singleton 
populations, and our fitness function is /. We describe a random walk on Q 
by single bit-flips and "filtered" by the function /. The next population is 
generated from the current population {u} as follows. First select a random 
bit position in u. Let u' be the string resulting from flipping that bit of u If 
f{uj') > fiuj) then the next generation is {cj'}; otherwise the next population is 
{uo'} with probability f{u')/f{uj) and the next population is uj with probability 
1 — f {uj') / f {(jj) . Clearly this modified random walk is a Markov chain (and it 
is also time- reversible) . The stationary distribution tx^ is 



Tl^ {uj) 



/(^) 



E.'en/(^') 



i2 



For example, with f{uj) = 2* where i is the number of I's in cj the optimal 
individual is 11 ... 1 which is sampled from the stationary distribution with 
high probability. Unfortunately, it is not in general known how to estimate 
the mixing time of a Metropolis-filtered random walk. On the positive side, 
in [2] they compute a volume in n-space using this method and they show 
that the filtered walk mixes essentially as fast as the corresponding unfiltered 
walk. A similar approach to combinatorial optimization using the Markov 
chain Monte Carlo method in the sense of a Metropolis process-type Markov 
chain having a stationary distribution that concentrates high probability on 
the optimal (or approximately optimal) solutions is surveyed in [12]. They 
give a polynomial time Metropolis process to find an approximate maximum 
matching in arbitrary graphs with high probability. More precisely, if G is an 
arbitrary graph on n vertices then the algorithm finds a matching in G of 
size at least [(1 — e)ko\ where /cq is the size of the maximal matching and e 
is an accuracy parameter which is assumed to be constant — the running time 
is actually exponential in 1/e. However, these successes are scarce. For the 
current status and references on Metropolis algorithms see [6]. 
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7 Conclusion and Further Research 



We have suggested a theoretical possibility of constructing genetic processes 
that provably optimize an objective function with high probability in polyno- 
mial time. We have given a simple example that, however, succeeds because of 
the abundance of optimal solutions. Altogether it seems difficult at this time 
to even construct an example of a genetic process that is both rapidly mixing 
and also has a nonuniform stationary distribution that heavily concentrates 
probability on populations containing optimal individuals in case such popu- 
lations are scarce. An example of this would give evidence of the power of the 
proposed method. 
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